String hashing for collection-based compression
نویسنده
چکیده
Data collections are traditionally stored as individually compressed files. Where the files have a significant degree of similarity, such as genomes, incremental backup archives, versioned repositories, and web archives, additional compression can be achieved using references to matching data from other files in the collection. We describe compression using long-range or inter-file similarities as collectionbased compression (CBC). The principal problem of CBC is the efficient location of matching data. A common technique for multiple string search uses an index of hashes, referred to as fingerprints, of strings sampled from the text, which are then compared with hashes of substrings from the search string. In this thesis, we investigate the suitability of this technique to the problem of CBC. A CBC system, cobald, was developed which employs a two-step scheme: a preliminary long-range delta encoding step using the fingerprint index, followed by a compression of the delta file by a standard compression utility. Tests were performed on data collections from two sources: snapshots of web crawls (54Gbytes) and multiple versions of a genome (26Gbytes). Using an index of hashes of fixed-length substrings of length 1 024 bytes, significantly improved compression was achieved. The compression of the web collection was six times more than the compression achieved by gzip and three times 7-zip. The genome collection was compressed ten times better than gzip and 7-zip. The compression time was less than taken by 7-zip to compress the files individually. The use
منابع مشابه
Compressed Image Hashing using Minimum Magnitude CSLBP
Image hashing allows compression, enhancement or other signal processing operations on digital images which are usually acceptable manipulations. Whereas, cryptographic hash functions are very sensitive to even single bit changes in image. Image hashing is a sum of important quality features in quantized form. In this paper, we proposed a novel image hashing algorithm for authentication which i...
متن کاملImage authentication using LBP-based perceptual image hashing
Feature extraction is a main step in all perceptual image hashing schemes in which robust features will led to better results in perceptual robustness. Simplicity, discriminative power, computational efficiency and robustness to illumination changes are counted as distinguished properties of Local Binary Pattern features. In this paper, we investigate the use of local binary patterns for percep...
متن کاملRobust Audio Hashing for Content Identification
Nowadays most audio content identification systems are based on watermarking technology. In this paper we present a different technology, referred to as robust audio hashing. By extracting robust features and translating them into a bit string, we get an object called a robust hash. Content can then be identified by comparing hash values of a received audio clip with the hash values of previous...
متن کاملApproximate String Similarity Join using Hashing Techniques under Edit Distance Constraints
The string similarity join, which is employed to find similar string pairs from string sets, has received extensive attention in database and information retrieval fields. To this problem, the filter-and-refine framework is usually adopted by the existing research work firstly, and then various filtering methods have been proposed. Recently, tree based index techniques with the edit distance co...
متن کاملPerformance in Practice of String Hashing Functions
String hashing is a fundamental operation, used in countless applications where fast access to distinct strings is required. In this paper we describe a class of string hashing functions and explore its performance. In particular, using experiments with both small sets of keys and a large key set from a text database, we show that it is possible to achieve performance close to that theoreticall...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015